The datasets are from Kaggle and FRED.
## Warning: package 'tm' was built under R version 3.6.3
## Warning: package 'quanteda' was built under R version 3.6.3
## Warning: package 'tidytext' was built under R version 3.6.3
## Warning: package 'ggthemes' was built under R version 3.6.2
## Warning: package 'maps' was built under R version 3.6.3
## Warning: package 'wordcloud' was built under R version 3.6.3
The bar chart in the graph shows the total gross of broadway shows each year from 1985 to 2019 and the line chart shows the pattern of average price of the mshow over these years. It is evident that both gross and average price increase through 35 years. The average price in 2019 is more than twice of the average price in 1985. However, we cannot tell that the growth of gross comes from only the increase of average price.
Seats sold by year is also a main factor that influence the gross in broadway. The bar chart below shows the pattern of total seats sold in each year from 1985 to 2019.It indicates that the total seats increase through the years but not as much as average price do.
Total seats and Per CPI of New York
What affect the total seats sold in Broadway? This part is aimed to figure out if the income affect the buying for the broadway show.
X - Per capita personal income(CPI) y - total seats per year
Since the price has been raising through the years, the total seats sold per year can be a proper factor to measure the buying.
The Scatter plot and the fit line shows that the total seat has a positive relationship with the per capita personal income.
According to the bar chart, the seasonal pattern is not significant no matter the year.
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : EOF within quoted string
From this part on, we will use text visualization to deeply analyze the synopses of all broadway shows. Firstly, let’s see the words which are used most often in all synopses. Top three are ‘musical’, ‘broadway’ and ‘new’, followed by ‘music’, ‘life’, ‘love’, ‘man’, ‘winner’, ‘songs’ and so on.
In this part, we use the Hu & Liu dictionary to calculate each words’ positive and negative score. Then, words are divided into two parts - positive and negative words based on their scores. Again, we use a wordcloud to show the differences between these two parts.
Obviously, positive words are more than negative words, as we could imagine, each show would use more positive words to describe their plot. Besides, we could find that Hu & Liu dictionary includes many neutral words into the two parts, which makes their differences less obvious.
How do the words appeared in the synopses of 1980s shows and 2010s shows differ in frequency? We select the synopses of shows from 1985 to 1988, and synopses of shows from 2017 to 2020, then compare the words used in these two groups to see if as time goes by, people’s usage of words would change greatly.
From the result we could see interestingly, 2010s shows tend to use words they share in common more than that of 1980s shows. In other words, 2010s shows are inclined to repeat the same words more than 1980s shows.
## Warning: Column `show` joining factors with different levels, coercing to
## character vector
## Warning in year == c(1985, 1986, 1987, 1988, 2017, 2018, 2019, 2020): 长的
## 对象长度不是短的对象长度的整倍数
## Selecting by seat_mean
## Selecting by seat_mean
## Selecting by seat_mean
## Assuming "lon" and "lat" are longitude and latitude, respectively
## Assuming "lon" and "lat" are longitude and latitude, respectively
## Assuming "lon" and "lat" are longitude and latitude, respectively
We can find the top 10 theatres in Broadway almost haven’t changed from 1990 to 2020 and the weekly gross continued to increase with time.